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Abstract 

We construct a locally-lexicographic SSOR preconditioner to accelerate the parallel 
iterative solution of linear systems of equations for two improved discretizations of 
lattice fermions: (i) the Sheikholeslami-Wohlert scheme where a non-constant block- 
diagonal term is added to the Wilson fermion matrix and (ii) renormalization group 
improved actions which incorporate couplings beyond nearest neighbors of the lat- 
tice fermion fields. In case (i) we find the block //-SSOR-scheme to be more effective 
by a factor ~ 2 than odd-even preconditioned solvers in terms of convergence rates, 
at /3 = 6.0. For type {ii) actions, we show that our preconditioner accelerates the 
iterative solution of a linear system of hypercube fermions by a factor of 3 to 4. 

Key words: lattice QCD, improved actions, perfect actions, hypercube fermions, 
SSOR preconditioning 



1 Introduction 

Traditionally, simulations of lattice quantum chromo dynamics (QCD) were 
based on nearest-neighbor finite difference approximations of the derivatives of 
classical fields. It is a general observation made in both quenched and full QCD 
that results from lattices with a resolution > 0.1 fm suffer from considerable 
discretization errors, see e.g. Ref. [1]. Even optimistic estimates expect an 
increase in costs of full QCD simulations oc as the lattice spacing a is 
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decreased [2].Q. 



The Wilson fermion formulation is appropriate with respect to flavor symme- 
try, but plagued by discretization errors of 0{a). These effects have been found 
to be sizeable e.g. in a compilation of quenched world data for quark masses 
[4] and in the determination of the renormalized quark mass, exploiting the 
PCAC relation in the Schrodinger functional [5,6]. 

In lattice QCD, the extraction of physical continuum results requires to ap- 
proach two limits: the chiral limit defined as the point in parameter space 
where the pion mass vanishes, and the continuum limit defined by vanish- 
ing lattice spacing a. The chiral limit amounts to an increase of the inverse 
pion mass (the correlation length C,n) in lattice units towards infinity. The lat- 
tice volume, i.e. the number of sites, must be increased accordingly, in order 
to control finite size effects. At this point, simulations of QCD with dynam- 
ical fermions encounter the problem of solving the fermionic linear system 
MX = 0, where M is the fermion matrix — a compute intensive task. 

The second step, moving towards the continuum limit, requires to decrease the 
lattice spacing a. The two issues are related: if one would be able to get reliable 
results at larger lattice spacing, one could avoid dealing with prohibitively fine 
physical lattice resolutions on large physical volumes. On the classical level, 
one might just use higher order derivative terms in the fermion action in order 
to push finite-a-effects to higher orders. But quantum effects will largely spoil 
the intended gains. 

At present, there are two major trends to improve the fermionic discretiza- 
tion. One approach follows Symanzik's on-shell improvement program [10]. 
Irrelevant (dimension 5) counter terms are added to both, lattice action and 
composite operators in order to avoid 0{a) artifacts. A particularly simple 
and hence preferred scheme is the Sheikholeslami and Wohlert action (SWA) 
[9], where the Wilson action is modified by adding a local term, the so-called 
clover term. Hereby, the amount of storage is doubled. The clover term is suf- 
ficient, in principle, to cancel the 0{a) errors in the action, provided that a 
constant csw is tuned suitably. The hope is to reach the continuum limit for 
a given scaling quantity Q as Q{a) = Q continuum + C'(o^), i-e. without 0{a) 
contamination. The development of a non-perturbative tuning procedure has 
been documented in a series of papers, see Refs. [5,6,11,12]. 

Another promising ansatz is based on Wilson's renormalization group [13]. 
It goes under the name perfect actions. Perfect lattice actions are located on 
renormalized trajectories in parameter space that intersect the critical surface 
(at infinite correlation length) in a fixed point of a renormalization group 

^ In extreme cases, in particular in QCD thermodynamics, the costs rise oc a~^^ 
[3]. 
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transformation. By definition, perfect actions are free of any cut-off effects, 
therefore they represent continuum physics at any lattice spacing a. 

In practice, perfect actions can best be approximated for asymptotically free 
theories starting from fixed point actions. Such fixed point actions can be 
identified in a multi-grid procedure solely by minimization, without performing 
functional integrals. Thus, the task reduces to a classical field theory problem. 
The fixed point action then serves as an approximation to a perfect action 
at finite correlation length; this is a so-called classically perfect action [14]. 
However, even in this approximation the couplings usually extend over an 
infinite range, so for practical purposes a truncation to short distances is 
unavoidable. In such schemes of 'truncated perfect actions' (TPA) one is forced 
to give up part of the original quest for perfectness, for reasons of practicability. 

It goes without saying that SWA and TPA can prove their full utility only after 
combination with state-of-the-art solvers in actual parallel implementations. 
In the recent years, the inversion of the standard Wilson fermion matrix could 
be accelerated considerably by use of the BiCGStab algorithm [7] and novel 
parallel /ocally- lexicographic symmetric successive over-relaxation (Z/-SSOR) 
preconditioning techniques [8]. 

We start from the hypercube fermion (HF) approximation formulated for free 
fermions in Ref. [15]. For an alternative variant, see Ref. [17]. In order to 
meet the topological structure of TPA, we shall follow a bottom-up approach 
by adding interactions to the free fermion case through hyper-links within 
the unit-hypercubc. This results in 40 independent hyper-links per site and 
amounts to a storage requirement five times as large as in SWA. 

In general, the fermion matrix for both SWA and TPA can be written in the 
generic form 

M = D + A + B + --- , (1) 

where D represents diagonal blocks (containing 12 x 12 sub-blocks), A is 
a nearest-neighbor hopping term, while B contains next-to-nearest-neighbor 
couplings, and so on. 

The key point is that one can include into the //-SSOR process (i) the internal 
degrees of freedom of the block diagonal term D as arising in SWA and (ii) 
2-space, 3-space, and 4-space hyper-links, as present in a TPA like HF. 

After reviewing the fermionic matrices for SWA and HF in section 2, we intro- 
duce locally lexicographic over-relaxation (//-SSOR) preconditioning of SWA 
in section 3. We discuss three variants for the diagonal blocks to be used 
in block SSOR. In section 4, we shall parallelize block SSOR within an ex- 
tended /Z-SSOR-scheme and we shall discuss the inclusion of the HF into this 
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framework. In section 5, we benchmark the block-//-SSOR preconditioncr on 
SWA — for several vahics of Csw — in comparison with odd-even precondition- 
ing. Our testbed is a set of quenched configurations on lattices of size 16'' at 
13 — 6.0 and our implementation machines are a 32- node APEIOO/Quadrics 
and a SUN Ultra workstation. Using the HF on a quenched 8^ system, we 
compare the SSOR preconditioned version with an non-preconditioned one, 
for a variety of mass parameters, again at (3 — 6.0. 



2 Improved Fermionic Actions 



In this section we briefly review the basics of SWA and TPA. To flx our 
notation, we write the fermionic lattice action as 

SF = J2^,Mix,y)<iIy, (2) 
where M is the fermion matrix. 



2. 1 Sheikholeslami- Wohlert Action 



For the Wilson fermion action (with Wilson parameter r — 1), supplemented 
by the Sheikholeslami- Wohlert term, we have 



Msw{x,y) = 



Jx,y 



(3) 



where n is the standard Wilson hopping parameter, csw is a parameter that 
can be tuned to optimize 0{a) cancellations, and /2 is a unit vector. 

The 'local' clover term consists of 12 x 12 diagonal blocks. Its exphcit structure 
in Dirac space is given by 



^ Fi F-2 -F3 F4 ^ 

F3 F4 Fl F2 
\Fl-F^Fl-F^j 



(4) 
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with the entries Fi being complex 3x3 matrices, which are shorthands for 
hnear combinations of the F^,^, 

= —Fi2, F2 = —F23 — iFi3, 

(5) 

-^3 = -^34, F4 — Fi4 — iF24- 

Ff^i, is defined by 

where the clover term f^^, reads 

UA^) U,{x)UAx + jl)Ul{x + v)Ul{x) 

+ UAx)Ul{x - /i + u)Ul{x - fi)U^{x - /i) 

+ Ul{x - fi)Ul{x - fi- 0)UAx - fi- i')Uy{x - u) 

+ Ul{x- v)U^{x - u)UAx + 11- p)Ul{x). 



The Wilson-Sheikholeslami-Wohlert matrix exhibits the well known 75 sym- 
metry 

l^Mswl^^Mly^r, (7) 
with the eigenvalues of Msw coming in complex-conjugate pairs. 

2.2 Hypercube Fermions 

The physical properties of a given lattice action remain unaltered under a block 
variable renormalization group transformation (RGT). As a simple example, 
we can divide the (infinite) lattice into disjoint hypercubic blocks of n*^ sites 
each and introduce new variables living on the centers of these blocks (block 
factor n RGT). Then the RGT relates 

~ E (8) 

where and (j)' represent the original and the new lattice fields, respectively. 
The points a; G Z"' are the sites of the original lattice and x' are those of the 
new lattice with spacing n. x & x' means that the point x belongs to the block 
with center x' . 
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Now the original action S[(l)] transforms into a new action S'[(f)'] on the coarse 
lattice. The latter is determined by the functional integral 

e-^'[^'I = |z}0i^[0',0]e-^['^l. (9) 

The kernel K[(j)', 0] has to be chosen such that the partition function and all 
expectation values remain invariant under the RGT. At the end, one usually 
rescales the lattice spacing back to 1. In any case, the correlation length in 
lattice units gets divided by n. 

For the kernel functional there are many possible choices [18]. A particularly 
simple choice for the kernel functional is 

kw,<p]=usU,-^:j:^^ ■ (10) 

x' ^ ^ x&x' ^ 

Assume that we are on a "critical surface", where the correlation length is 
infinite. With a suitably chosen renormalization factor (3n we obtain — after an 
infinite number of RGT iterations — a finite fixed point action (FPA) S* [0] . An 
FPA is invariant under the RGT. The task of /9„ is the neutralization of the 
rescaling of the field 0' at the end. The FPA is an example of a perfect lattice 
action since it is insensitive to a change of the lattice spacing. 

Eq. (10) can be generalized, for instance, to a Gaussian form of blocking. For 
free fermions, a generalization of this type reads (we ignore constant factors 
in the partition function) 



X 



exp { - 1 x: [^r;, - [^;, _ _1_ ^^.] j. (n) 



x& 



Here we have already inserted the suitable parameter /3„ and we introduce 
a new RGT parameter a, which is arbitrary. The critical surface requires a 
fermion mass m = 0, but we can generalize the consideration to a finite mass. 

Assume that we want to perform a number N of RGT iterations. If we start 
from a small mass m/ (nN), then the final mass will be m. In the limit uN — > 
oo we obtain a perfect action at finite mass. In this context, "perfect" means 
that scaling quantities do not depend on the lattice spacing, hence they are 
identical to the continuum values. 

For the above transformation (11), this perfect action can be computed ana- 
lytically in momentum space [19]. The computation simplifies if we let n oo, 
so that iV = 1 is sufficient. Hence we start from the continuum action now, 
and the perfect action takes the form 
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where A* is the perfect propagator. The same perfect action is obtained start- 
ing from a variety of lattice actions, in particular from the Wilson fermion 
action. 

In coordinate space we write this action as 

= ^^,[p^(r)7^ + A(r)]*,+,. (13) 

x,r 

For a > 0, where the RGT breaks the chiral symmetry explicitly, the couphngs 
in and A decay exponentially as |r| increases. An exception is the case d — 1, 
where they are confined to one lattice spacing for the special choice 



a = ^ . (14) 

It turns out that for this choice of a the locality is also excellent in higher 
dimensions, i.e., the exponential decay of the couplings is very fast. This is im- 
portant, because for practical purposes the couplings have to be truncated to 
a short range, and the truncation should not distort the perfect properties too 
much. An elegant truncation scheme uses periodic boundary conditions over 
3 lattice spacings and thus confines the couplings to a unit hypercube. This 
yields the HF, with spectral and thermodynamic properties still drastically 
improved compared to Wilson fermions [15,20]. 

Of course, it is far more difficult to construct an approximately perfect action 
for a complicated interacting theory like QCD. However, as a simple ansatz we 
can just use HF together with the standard gauge link variables. Apart from 
nearest neighbors, we also have couplings over 2, 3 and 4-space diagonals in 
the unit hypercube. We connect all these coupled sites by all possible shortest 
lattice paths, by multiplying the compact gauge fields on the path links. We 
call this procedure "gauging the HF by hand" . Note that one can connect two 
sites X and y lying on 2, 3, and 4-space diagonals via d\ such shortest lattice 
paths. We average over all of them to construct the hyper-hnk, see Fig. 1. 



Let us identify the hyper-link Uj^^x) between site x and x + fi with 

r(2) 



and let us denote the hyper- link in plane, cube and hyper-cube as U^a+vi"^) 



t/^%+p(a;), and U^jflyj^pj^fj{x), respectively. Then we can write the correspond- 
ing fermion matrix in terms of the hyper-links which can be constructed re- 
cursively starting from the gauge links 
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U^'^^ (x) - - 



+ . . . 



(15) 







(2) 




U(x)/ 




li+v'/ 


^U(x) 
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(x-Hi) 



(x) 



Fig. 1. 1-space, 2-space and 3-space hyper-links. 



It is convenient to introduce pre-factors which are functions of the HF hopping 
parameters Ki and Aj, i = 1, . . . , 4, and sums of 7-matrices: 



r±M±L.±p = A3 + K3(±7m ± 7/. ± 7p) 
r±M±t/±p±(T = A4 + /«4(±7// ± 7i/ ± 7p ± 7(t) ■ 



(16) 



Note that the A, in eq. (16) differ from A(r) in eq. (13) by a normahzation 

Ao' 



factor The arise from p^(r) by the same normahzation. 
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The corresponding HF matrix is defined by 



MHF{x,y) = Ao< S^^y 



+E 



x,y+fi 



r 



(2) 
(2) 



/i V>flp>U 



''f^+H—v+p Ui^_i,j^p{x)5x,y-fj,+i/—p 
—P-+V+P ^-p.+u+p{x)^x,y+[i-i>~p 
{x)^x,y+[i+i>-p 



r(3) 



'^^—fi—i'—p U—ix—v+p 



+ r 



u, 



(3) 



(^)<^a;,j/-/i-i>+p 
~l~ ^ +p,—v—pU p^—i,^p(^x)5x,y—[j,+v+p 



+IJ-+iy-p'~^ fl+U-p 

r(3) 



^7(3) 



{x)^x, 



.y+p.-i>+p 



~l~ ^—p-~i'—pU—fj,—i/—pi'^)^x,y+p,+i>+p 



U^+u+p+ai^)^: 



+EE E E 

p, v>pp>va>p 

r 

+p+v-p+a 
'^'^ +p-u+p+<j 

+p-i'-p+<T U ^_y_pj^(^{x)8x^y-p+u+p—a ~l~ 
~ix+v+p+cr 
-H+v—p+a 



u, 



x,y-p-iy-p 

p+i/—p+aK'^)'-'x,y—fl — u+p—a- 
(4) 



p—i/+p+ai-^)8x,y—fl+u—p—(T ~r J- 



' r 

r. 



crU^^i,^p—u{x)Sx^y—fi—Ci—p+a 
-a-U ^^y—p—u{x)6x^y—p—ii+p+a 
■crU p__yj^p_„(yX^5x^y—p+u—p+a- 

(jU p^-i,-p—cr{,-^^^x,y—jX+i'+p-\-a 

-p+i'+p—a{,'^^^x,y+ii—ii—p+a 
■(TU-p-\-u—p—r7{-^)^x,y+[i—i>+p+a 
■aU —p^—yj\-p—(j (yX^5x,y+p+v—p+a 
-a-U —p_—i,—p—cr{x)^x,y+ii+ii+p+a 



u. 



(4) 



p+v+p+a{.x)^x,y+ii-v-p—cr^ ^ -p+v+p- 
^—p+v—p+a {.■^^^x^y+p—O+p—a^ ^ —p+v—p- 
—p—v+p+cr U_p_^i,j^pj^^{x)5x,y+p+v—p—cr\' ^—p—u+p- 
^—p—i'—p+a{-^)^x,y+p,+i>+p—a'^ ^—p—u—p- 



—p—u—p+o 



(17) 



The sums in eq. (17) run over four different directions for two 1-space links, six 
directions for four 2-space hnks, four directions for eight 3-space hnks, and one 
direction for the sixteen 4-space hnks. Altogether 80 hyper-links contribute. 

With each path of the free HF, a 7-matrix is associated. We have chosen the 
7-matrices so that they add up to produce an effective F, see eq. (16), which 
is associated with a given hyper-link. 
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The 1-space links U^\x) are identical to the hermitean-conjugate links in 
negative direction, U'^S^ {x + (1) , and this feature also holds for the hyper-links, 
e.g., 

Ul^^l^^+...+^^{x) = uL%_^,^^...^^^{x + /ii + /i2 + ■ ■ ■ + fid)- (17) 

Therefore, only one half of the hyper-links have to be stored in the implemen- 
tation of the HF. 

As in case of Wilson fermions, the HF matrix exhibits the "75 symmetry" , 

75 Mhf15 = Mj^p, (18) 

i.e., Mhf is non-hermitean but its eigenvalues come in complex-conjugate 
pairs0. 

The off-diagonal elements ("hopping parameters") Hi and Aj are shown as 
functions of the mass m in Fig. 2. 




-1 1 2 3 4 

Fig. 2. The vector "hopping parameters' 
Xi as functions of the mass m. 




Ki, and the scalar "hopping parameters' 



3 Block SSOR Preconditioning 



Preconditioning a linear system 

MV' = </) (19) 

amounts to selecting two regular matrices Vi and V2, which act as a left and a 
right preconditioner, respectively. This means that we consider the modified 

^ Using such a fermionic action at m = 0, one obtains for instance a strongly 
improved meson dispersion relation [15]. For the inclusion of a truncated perfect 
quark gluon vertex function, see Ref. [16]. 
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system[^ 

V^^^MVf^i) = 0, where := V, ^ := (20) 

The spectral properties of the preconditioned matrix V^^ MV^^^ depend only 
on the product V1V2, but not on the manner how it is factorised into Vi and V2. 
For a good choice, the number of iteration steps required for solving eq. (20) 
by Krylov subspace methods (such as BiCGStab) can be reduced significantly, 
compared to the original system (19). 

In this paper, we consider block SSOR preconditioning. Let M = D — L — U 
be the decomposition of M into its block diagonal part D, its (block) lower 
triangular part —L and its (block) upper triangular part —U. Given a relax- 
ation parameter u ^ 0, block SSOR preconditioning is then defined through 
the choice 

V, = (-D - l) (-d) \ V2 = -D-U . (21) 

It is important that block SSOR preconditioning is particularly cheap in terms 
of arithmetic costs due to the Eisenstat trick [21], which is based on the identity 



UJ 



-D -D-L 



UJ 



-1 



-D 



UJ 



(1 



(D-L-U) (-D 

\UJ 

-1 



U 



-1 



f22) 



-D-L 



UJ 



ujLD- 



1 + 

1 + (UJ 



UJ 



-D 



UJ 



-D 



UJ 



2) 1 - ujUD- 



U 

-1 



-1' 



D-U 



(1 



UJ 



ujUD- 



Therefore, the matrix-vector product with the preconditioned matrix, z = 
V^^MV^f^x, can be computed according to Algorithm 1. 

solve (1 — ujUD^^)y = x 
compute w = X + {uj — 2)y 
solve (1 — ujLD^^)v = w 
compute z = V + y 

Alg. 1 Matrix-vector product for the preconditioned system. 

Here the matrices 1 — ujUD~^ and 1 — ujLD~^ are (non-block) lower and up- 
per triangular, respectively. This means that solving the corresponding linear 
systems amounts to a simple forward and backward substitution process. Al- 
gorithm 2 gives a detailed description of the forward substitution for solving 
(1 — ujLD^^)v = w. Here we denote the block components of the vectors v, w 

^ We mark preconditioned quantities by a tilde. 
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and the matrices D,L as Vi,Wi and Da, Lij. The backward substitution for 
{1 — LuU D~^)y — X can be carried out analogously. 



Assuming that the blocks Da of D are dense and that their inverses have been 
pre-computed, we see that one iteration step in the above algorithm is exactly 
as expensive as a direct multiplication with the matrix D — L (except for the 
additional multiplication with cu). A similar relation holds for the backward 
substitution and the multiplication with D — U. Note that the two multipli- 
cations with D — L and D — U are as expensive as one multiplication with the 
whole matrix M plus one additional multiplication with D. This allows us to 
quantify exactly the work required when using block SSOR preconditioning 
with the Eisenstat trick. 

• Initialization: the inverses D^^^ of all diagonal blocks of the block diagonal 
matrix D must be pre-computed before the iteration starts. We also assume 
that these inverses are already scaled by the factor uj in the initiahzation. 

• Iteration: each iterative step requires additional arithmetic work equivalent 
to one matrix- vector multiplication with the matrix D plus one vector scal- 
ing (with factor lu — 2) and two vector summations. 

In SU(3) lattice gauge theory, a natural choice for the block diagonal matrix D 
takes each block Da to be of dimension 12, corresponding to the 12 variables 

residing at each lattice point. In this work, we will consider this choice, denoted 
as D'^^'^\ as well as the three generic options D^^\ D^^^ and D^^\ where the 
diagonal blocks are of dimension 6, 3 and 1, respectively. The choices D^^^ 
and D^^^ also appear 'natural' — at least within the SWA framework — since a 
diagonal block of D^^'^^ carries the structure of eq. (4). Accordingly, ignoring 
the parameters k and Cswj four consecutive 3x3 blocks in D^^^ are given by 
1 + Fi, 1 — Fi, 1 + Fi, 1 — Fi and two consecutive 6x6 diagonal blocks in 
D^*^^ are identical and given by 



Table 1 quantifies the arithmetic effort for computing a matrix-vector product 
with each of the matrices D'^^'\ D^'^^ and D^-^\ We count this effort in 

units of cflops, which represent one multiplication of complex numbers followed 
by one summation. The table also quotes estimates for the arithmetic work 
to compute the inverse of each of these matrices in units of matrix-vector 



for z = 1 to 

Si = ujD'i^Vi 



Alg. 2 Forward substitution. 




(23) 
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Tabic 1 



Arithmetic cost for different sizes of tlie diagonal blocks. V is the lattice volume. 







L>(6) 


L>(3) 




MVM (in cflops) 


lUV (27.2%) 


72V (13.6%) 


36V (6.8%) 


V (0.2%) 


inverse (in MVM) 


10 


2.5 


2 


1 



multiplies (MVM). Precise numbers will depend on the particular algorithm 
chosen for the inversion. The estimates in Table 1 are based on a particularly 
efficient way for computing the inverse, which uses Cramer's rule on 3 x 3 
blocks and which exploits the additional block structure within each of the 
i^!?andDr- 



The percentages given in brackets quantify these numbers in terms of the 
cost for a single matrix-vector multiply with M. Referring to our previous 
discussion, they specify the additional cost for (block) SSOR preconditioning. 
V denotes the lattice volume and one matrix- vector multiplication with M is 
counted with 5281^ cflops. 



4 Pcirallelization 

In the fermion equation (19), we have the freedom to choose any ordering 
for the lattice points x. Different orderings yield different matrices M, which 
are permutationally similar to each other. One matrix can be retrieved from 
the other one by the transformation M — > P'^MP, where P is a permutation 
matrix. In general, the quality of the block SSOR preconditioner depends on 
the ordering scheme. 

On the other hand, the ordering of the lattice points also determines the degree 
of parallelism within the forward (and backward) substitutions as described 
in Algorithm 3. Usually, there is a trade-off between the parallehsm a given 
ordering allows for, and the efficiency of the corresponding SSOR precondi- 
tioning. 

In an earher paper [8], we have shown that for the non- block SSOR precon- 
ditioner and the Wilson fermion matrix one can use a locally lexicographic 
ordering on parallel computers supporting grid topologies, so that the result- 
ing SSOR preconditioner parallelizes nicely while significantly outperforming 
the standard odd-even preconditioner. The purpose of this section is to show 
that this is also possible for the block SSOR preconditioners considered here, 
even for situations where M represents couplings beyond nearest-neighbor 
lattice points. 

Let n{x) denote the set of all lattice points a given site x is coupled to. For 
example, n{x) = {y \ y — x ± p,, /j, — 1,... ,4} for the nearest-neighbor 
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coupling, n{x) — {y ^ x \ — < 2} for nearest and ncxt-to-nearest 

neighbor couplings, or n{x) — {y ^ x \ — < 1, = 1, . . . ,4} for the 
hyper cube couplings. 

We now re-formulate the forward substitution of Algorithm 2 for this general- 
ized situation. Wc assume an overall natural partitioning of M into sub-blocks 
of dimension 12 (corresponding to the 12 variables at a given lattice point), 
and we use the lattice positions x, y as indices for those blocks. By D we de- 
note any of the matrices D^^'^\ . . . , D'^^\ so D^x stands for a diagonal block of 
dimension 12 x 12. It is fully occupied in the case D — D^^'^\ whereas in case 
D = D^^^ it consists of two 6x6 diagonal blocks, etc. We also use the relation 
X <:o y between lattice points to denote that x has been numbered before y 
with respect to a given ordering o. 

for all lattice positions x in a given ordering o 

Vx — Wx + 'Yliy^n{x), y<oX ^xySy 

Sx = ujD-^Vx 
Alg. 3 Generalized forward substitution. 

To discuss parallelization, we use the concept of coloring the lattice points. 
A decomposition of all lattice points into mutually disjoint sets Ci, . . . , is 
termed a coloring (with respect to the matrix M), if for any I e {1, ■ ■ ■ ,k} 
the property 

X & Ci ^ y ^ Ci for all y e n{x) 

holds. Associating a different color with each set Ci, this property means 
that each lattice point couples with lattice points of different colors only. An 
associated color ordering first numbers all grid points with color Ci, then all 
with C2 etc. With such a color ordering we see that the computation of Vx for 
all a; of a given color Ci can be done in parallel, since terms like J2yen{x), y<ox 
involve only lattice points from the preceding colors Ci, . . . , C/_i. 

In the case of nearest-neighbor couplings, the familiar odd-even ordering rep- 
resents such a coloring with two colors corresponding to the odd and the even 
sublatticc. For the case of the Wilson fermion matrix, we pointed out in [8] 
that the (non-block) SSOR preconditioned system may be interpreted as a 
representation of the familiar odd-even reduction process. A similar relation 
arises in the case of SWA, where the reduced system considered in Ref. [22] is 
equivalent to the (12 x 12 block) SSOR preconditioned matrix with odd-even 
ordering. 

For more complicated couplings like the next-to-nearest neighbor couplings 
or the HF, it would become increasingly difficult to handle colorings with a 
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minimum number of colors. For example, the hypercube ordering requires at 
least 16 different colors. 

However, aiming at a minimal number of colors is not a good strategy. For 
example, the odd-even coloring can actually be considered as the 'worst case' 
as far as the quality of the corresponding SSOR preconditioner is concerned 
[8]. Heuristically, this can be explained as follows: if the number of colors is 
small, the color sets Ci themselves are large, and information is not spread 
between lattice points of equal color in a forward (or backward) substitution. 

Therefore, the right strategy is to search for colorings such that the number 
k of colors is maximal^ while the number of points within each color is still in 
agreement with the parallelization we are aiming for. 

The locally lexicographic ordering, proposed in Ref . [8] for the case of a nearest- 
neighbor coupling, turns out to be an adequate ordering also for more com- 
plicated couplings like next-to-nearest neighbor and hypercube. 

To describe this ordering, we assume the processors of the parallel computer to 
be connected as a 4-dimensional grid Pi x p2 x Ps x P4- Note that this includes 
lower dimensional grids by setting some of the pj to 1. The space-time lattice 
can be matched to the processor grid in a natural manner, producing a local 
lattice of size n^°^ x n^2'^ x n^^'^ x n^^^ with n^°^ = rii/pi on each processor. Here 
we assume for simplicity that each pi divides nj, and that we have n''°'^ > 2 for 
i = l,... ,4. 

Let us divide the lattice sites into n'"^ groups where n''"'^ = n^°'^n^2^n^.^'^n^^'^ . 
Each group corresponds to a fixed position within the local grids and contains 
all grid points appearing at this position within their respective local grid. 
Associating a color with each of the groups, we get a coloring in the sense of 
the definition above, as long as the coupling defined through M is local enough. 
More precisely, the sets represent a coloring, if for all y G n{x) the relation 
\y^l — < n^°'^ holds for = 1, . . . , 4. For example, we need n^°'^ > 2 for all 
II for the hypercube couplings and n''°'^ > 3 for all /x for the next-to-nearest 
neighbor couplings. 

A locally lexicographic {II) ordering is now defined to be the color ordering, 
where all points of a given color are ordered after all points with colors, which 
correspond to lattice positions on the local grid that are lexicographically 
preceding the given color. In Fig. 3, this amounts to the alphabetic ordering of 
the colors a - q. This example also illustrates the decoupling obtained through 
that ordering for (2-dimensional) nearest-neighbor and hypercube couplings. 

The parallel version of the forward substitution in Algorithm 4 with the II- 
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Fig. 3. Locally lexicographic ordering and forward solve in 2 dimensions. Near- 
est-neighbor: straight arrows, hyper-cube couplings: straight and thin diagonal ar- 
rows. 

ordering now reads: 

for all colors in lexicographic order 
for all processors 

X :— grid point of the given color on that processor 

f^x — '^x + ^yen{x), y<iix ^xy^y 

Alg. 4 ZZ-forward substitution. 

If the lattice point x is close to the boundary of the local lattice, then the set 
n{x) will contain grid points y residing in neighboring processors. Therefore, 
some of the quantities Sy will have to be communicated from those neighbor- 
ing processors. The detailed communication scheme for the case of a nearest- 
neighbor coupling was given in Rcf. [8]. In that case, only the 8 nearest neigh- 
bors in the processor grid were involved in the communication. For the more 
complicated HF couplings, all 80 hypercube neighbors may be involved. For a 
3- or 2-dimensional processor grid, this number reduces to 26 resp. 8. 
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5 Results 



The //-SSOR preconditioning of improved actions has been tested in quenched 
QCD, for reahstic lattice sizes and parameters. For SWA we use the odd- 
even preconditioner of Ref. [22] as reference. The HF action preconditioner 
has been implemented only on a scalar machine so far. Work for a parallel 
implementation is in progress. 

5. 1 Sheikholeslami- Wohlert Action 

We are going to compare results from test runs of //-SSOR and odd-even 
preconditioning, both codes being equally well optimized for the multiplication 
of the SWA fermion matrix. Our investigations are based on a de-correlated 
set of 10 quenched gauge configurations generated on a 16^ lattice at /? = 6.0. 
We have taken measurements at 3 values of csw, 0, 1.0 and 1.769. The latter 
value is the optimal quenched csw coefficient taken from Ref. [12]. 

In order to provide both machine independent numbers and real time results on 
parallel and scalar implementation machines, we will present iteration numbers 
which (i) are directly proportional to the amount of floating point operations 
and (a) real time results from implementations on both the parallel system 
APEIOO/Quadrics and a SUN Ultra workstation. 

We have applied BiCGStab as iterative solver. The stopping criterion has 
been chosen as r = ^^^^ji*^^^ < 10~^. We used a local source. At the end of 
the computation, we have checked how far the true residuum deviates from 
the accumulated one. In fact, for //-SSOR, the accumulated residuum turned 
out to deviate only slightly from the corresponding true residuum. Moreover, 
deviations between the solutions X as computed by //-SSOR and odd-even- 
preconditioning have been checked. We found the norms of the solution vectors 
to differ in the range of 10~^. 

In a first step, we have determined the optimal value for the over-relaxation 
parameter u as introduced in eq. (21), see Fig. 4. The u dependence of the 
iteration numbers is measured for csw = 1-769 at a given value for the hopping 
parameter k = 0.1333. At this value, the fermion matrix is close to criticality. 

In Fig. 4, the results from three diagonal block sizes are overlaid, the 1x1, 
3x3, and 6x6 blocks. Only a weak dependence of the iteration numbers 
on the block size is visible, however. Around = 1.4 a minimum in iteration 
numbers is foundQ. We have verified that this number holds for the whole 

Ref. [8] considered only the case w = 1 for csw = 0- 
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Fig. 4. cj-dependence of ZZ-SSOR. We plot the gain factor between ZZ-SSOR and 
odd-even preconditioning. 

K-range investigated and for the other values of csw as well. 

Next we benchmark the //-SSOR preconditioncr against the odd-even precon- 
ditioner. In Fig. 5, the iteration numbers are presented as a function of k, 
separately for the three values chosen for csw- We plot the ratio of iteration 
numbers of the odd-even procedure vs. ZZ-SSOR. In the last two segments of 
the figure, three block sizes are overlaid. 

The improvement of ZZ-SSOR compared to the odd-even preconditioned system 
is rather substantial: close to rric, i.e. in the region of interest, a factor up to 2.5 
in iteration numbers can be found, with increasing tendency when approaching 
Kc- As far as the dependence of the improvement factor on csw is concerned, 
one cannot detect a systematic effect. Significant block size dependencies are 
not visible either. However, in the actual time measurements on APEIOO to be 
shown below, we will find the 3x3 local diagonal block procedure to perform 
best. 

The above results have been achieved on a machine equipped with p = 32 
processors. With V — 65536 being the number of sites on the lattice the 
sub-lattices comprise 2048 sites each. As the regions which are treated inde- 
pendently in the preconditioning process are as large as the size of a sub-lattice 
assigned to a given processor, the parallelism of the preconditioncr follows the 
number of processors p. The larger the sub-lattices the better is the improve- 
ment factor [8] , since the apphcability of the ZZ-SSOR preconditioncr seems to 
be limited to low granularity. However, it turns out that on today's machine 
sizes — ranging from coarse parallelism with (9(10) processors to massive par- 
allelism with 0(1000) processors — usual lattice sizes lead to sufficiently large 
sub-lattices to ascertain effectively parallel preconditioning. For the 3 x 3 di- 
agonal block ZZ-SSOR procedure, we have investigated the local lattice size 
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Fig. 5. K dependence of ZZ-SSOR vs. the odd-even preconditioner for three values 
of csw- 
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Fig. 6. Local lattice size dependence of //-SSOR preconditioning as determined for 
a fixed size system on 32, 128, 256 and 512 node APEIOO/Quadrics machines. 

dependence working on four different APEIOO/Quadrics systems [23], a 32- 
node Q4, a 128-node QHl, a 256-node QH2, and a 512-node QH4. For a given 
lattice size, the sub-lattice sizes follow the inverse number of nodes. In the 
range investigated, the improvement varies by about 10 %. 

Going to the most effective p = 1 limit, in fact, the //-SSOR preconditioner 
is identical to the SSOR preconditioner which, for = 1, is equivalent|3| to 
an incomplete LU preconditioning, introduced by Oyanagi [24]. In Fig. 7, we 
present the ensuing improvement factor for the iteration numbers as a function 
of K for the three values of csw, again plotting the gain factor between //-SSOR 
and odd-even results {u = 1.4). 

As already demonstrated, the gain-factor of SSOR compared to the odd-even 
preconditioned system is larger here than for parallel //-SSOR. Close to Kc, we 
can verify a factor of up to 3 for iteration numbers with increasing tendency 
going towards Kc- Thus, the improvement factors reported in Refs. [24] and 
[25] for standard Wilson fermions are confirmed for SWA. Again, as to the 
dependence of the improvement factor on csw, one cannot find a significant 
variation. 

Finally, on the APEIOO/Quadrics parallel system [23], we have implemented 
and optimized both //-SSOR and odd-even preconditioners in order to compare 
real costs. 

Following the above results, we measured at = 1.4, k = 0.1333 and csw = 
1.769. In the case of //-SSOR we applied the local diagonal block procedure for 
1x1,3x3, and 6x6 blocks. Additionally, as for the odd-even preconditioner, 

^ For Wilson parameter r = 1. 
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Fig. 7. K dependence of SSOR peconditioning vs. odd-even preconditioning for 
three values of csw- 
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Fig. 8. Time measurements for the ZZ-SSOR preconditioner on a 32-node 
APEIOO/Quadrics for SWA. Four different local diagonal blocking methods are 
benchmarked. 
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Fig. 9. Time measurements for the SSOR preconditioner compared to odd-even pre- 
conditioning on a SUN Ultra workstation. Again, four different local diagonal block 
size methods are investigated together with the odd-even preconditioned system. 

we performed a pre-inversion of the local 12 x 12 blocks. However, we remark 
that these blocks require a memory expense of a factor of 9 compared to the 
non-blocked version. 

Although one does not achieve an improvement in iteration numbers between 

1x1 and 12 x 12 blocks, it turns out to be advantageous to choose a specific 
block size for a given implementation machine. For APEIOO, the optimal block 
size is a 3 X 3 block. The results are plotted in Fig. 8. The corresponding results 
as achieved on a SUN Ultra are given in Fig. 9. 
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Fig. 10. Determination of rric by use of CG inversions. 

On the scalar system, the gain factor in iteration numbers fully pays off for the 
diagonal 1x1 algorithm. It is not surprising that the gain becomes smaller for 
the larger diagonal blocks since larger blocks means more compute operations. 
The apphcation of the SSOR principle to solve the system — including the non- 
trivial diagonal — completely and thus avoiding its explicit pre-inversion, turns 
out to be most effective. We note that on scalar machines the gain in iteration 
numbers fully translates into a gain in compute time. On APEIOO/Quadrics, 
the improvement gain is deteriorated by intensive integer operations, a weak 
point of APEIOO. 



5.2 Hypercube Fermions 

The HF action has been rendered gauge invariant 'by hand' similar to the 
procedures in Rcf. [15-17]. As such it is, strictly speaking, not even truncated 
perfect, however, its coupling structure is similar to the structure of the latter. 

So far we have implemented and tested SSOR preconditioning for HF on a 
scalar machine. We have already mentioned that the number of SU(3) matrices 

per site to be stored is increased by a factor of 5 compared to clover fermions. 
Limited by the number of hyper-links to store, we decided to investigate HF 
on a lattice of size 8^. Our implementation on a SUN Ultra has been written 
in FortranQO. 

We measured at /? = 6.0 in quenched QCD. First, we tried to assess the critical 
mass parameter, in order to determine the critical region of HF. We used a 
method introduced in Ref. [18] which makes use of the dependence of CG 
iterations on the condition number of the matrix, cf. Fig. 10. 
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Fig. 11. Dependence of the iteration numbers on the over-relaxation parameter lo. 



Close to the critical mass rric we have assessed the optimal over-relaxation 
parameter u; for the SSOR method, see Fig. 11. Here and in the following we 
compare the result with the unpreconditioned BiCGStab solution since simple 
odd-even preconditioning cannot be applied in the case of HF. As a result, we 
chose u = 1.0. Fig. 12 presents our findings for the iteration numbers as a 
function of m. 

So far, we conclude that SSOR preconditioning of the HF action, gauged by 
hand, leads to gain factors up to 4 close to the critical mass parameter nic- 

However, we regard the specific form of HF as a preliminary test case only 
since a consistent derivation of TPA includes clover-leaf like terms. 

In view of the tremendous compute effort, exceeding that of Wilson fermions 
by a factor of more than 10, we regard preconditioning as a mandatory prereq- 
uisite for HF to become competitive with traditional fermion discretizations. 
This conclusion applies equally well to other variants of HF. 

As far as the storage requirements are concerned, again a factor of 10 is found 
which cannot be avoided. For a given memory limit, this translates into a factor 
of 0.56 in linear lattice size or 1.8 in scale compared to Wilson fermions. Thus, 
from a technical point of view, HF would in principle pay off if their reduced 
scaling violations allow coarsening by a factor of 1.8. This is a question for 
future investigations where we will include a perfect truncated gluonic action 
together with a full-fledged TPA. 



24 



5 




' ' I ' 

-1 -0.5 0.5 1 1.5 2 
m 

4 |. . , I 




' ' I ' 

-1 -0.5 0.5 1 1.5 2 
m 



Fig. 12. Dependence of the solution on the mass parameter m. We show both (a) the 
iteration numbers of the non-preconditioned case vs. the ZZ-SSOR preconditioned 
results and (b) the corresponding gain factor in time. 

6 Conclusions 

We have constructed locally-lexicographic SSOR preconditioners for inversions 
of linear systems of equations from two improved fermionic actions, the Sheik- 
holeslami-Wohlert- Wilson scheme with non-constant block-diagonal and re- 
normalization group improved hypercube fermions, with interaction gauged 
ad hoc. 

For SWA we find the block ZZ-SSOR-scheme to be more effective by factors up 
to 2.5 compared to odd-even preconditioned solvers. For HF we have demon- 
strated that SSOR preconditioning accelerates the iterative solver by a factor 
of 3 to 4 compared to the non-preconditioned system. We believe that the im- 
provement for HF will translate also into other TPA with interaction derived 
from renormalization group transformations. 
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